# Data Visualization — Data visualization involves the creation and study of the visual representation of data. To communicate information clearly and efficiently data visualization uses statistical graphics, plots, information graphics and other tools. Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable.
ggplot2 is the most popular package for data visualization in R. Created by Hadley Wickham, ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics— a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R.
In this notebook we will focus on ggplot2 and will try to cover its main functionalities. If you have followed the Setup notebook you should already have ggplot2 installed since it’s a part of tidyverse library. Before being able to use a package we need to first load it into the environment by
library(ggplot2)
Alternatively one can load the entire tidyverse to include all of its core packages:
# loading libraries
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# modifying chart size
options(repr.plot.width=5, repr.plot.height=3)
These 8 packages are used in almost every data analysis in R. The last two lines are warning that you can ignore for now.
### First graph Let’s use one of the datasets that come with ggplot2 to make a graph and answer a simple question.
Question: Do cars with big engines use more fuel than cars with small engines?
dataset: mpg - contains fuel economy data from 1999 and 2008 for 38 popular models of car.
Another way of calling a dataset that comes from a package is to first specify the package name followed by two colons, e.g.: ggplot2::mpg. This is usually optional, but a good practice for extra clarity if needed.
To see the content of mpg you can simply type > mpg
in the console (or a notebook cell). To avoid getting too many rows back in the notebook I use head() function from base R to only load the first few rows. If you are using RStudio then by just typing mpg you should get the right amount of output that fits in your screen, it’s because mpg is a tibble, we will learn more about tibbles later.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
Among the variables in mpg are * displ, a car’s engine size, in litres. * hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
To access the help page and find out more about this data frame use ?mpg command or press F1. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
To make a plot using mpg dataset run this code to put displ on the x-axis and hwy on the y-axis:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with big engines use more fuel.
With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. ggplot(data = mpg) creates an empty graph.
You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, aka scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You’ll learn a whole bunch of them throughout this notebook.
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot() looks for the mapped variable in the data argument, in this case, mpg.
### The grammar of graphics As mentioned above ggplot2 breaks up graphs into semantic components such as scales and layers. For instance the graph we just plotted consists of the following three layers
These three layers are the minimum requirements for the data to be visualized.
Data - The source of information to be plotted. > ggplot(data = mpg)
This statement by itself would result in an empty canvas. In order to show the points we need the other two layers.
Aestetics - For specifying the attributes of the plot. The aesthetics mapping describe how variables in the data are mapped to visual properties of the geometric objects. The mapping argument is always paired with aes(). > mapping = aes(x = displ, y = hwy)
Geometrics - Specifies the geometric object to be used for data visualization. For the example above we have used geom_point() to show the data points in a scatterplot. > ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
We can use the following template which represents these 3 layers. Replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings: > ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Facets - Allow us to visualize multiple groups of the same data within one canvas. For example, in our example one could group the cars by their drv * 4: Four-wheel drive * f: Front-wheel drive * r: Rear-wheel drive
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ drv, nrow = 2)
Statistics This layer is to summarize or transform the data before plotting it. For instance we can fit a line to the previous plot by geom_smooth()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ drv, nrow = 2) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Coordinates
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ drv, nrow = 2) +
geom_smooth(mapping = aes(x = displ, y = hwy)) +
coord_cartesian(xlim = c(3, 6))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are two ways to set the theme. 1) for each individual plot: add a theme layer (example below) 2) for all the plots in the script: call function theme_set() and set the global theme inside the function.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ drv, nrow = 2) +
geom_smooth(mapping = aes(x = displ, y = hwy)) +
coord_cartesian(xlim = c(3, 6)) +
theme_dark()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Install ggthemes package for additional themes.
What does the drv variable describe? Read the help for ?mpg to find out.
# Your answer goes here
Now that we have a general idea of all the seven layers of ggplot2 let’s take a deeper dive into some of them.
## Aesthetics Looking back at the first graph we made, we see that there are a few car models that despite their engine size are relatively efficient (~25 mpg) and seem to fall out of the linear trend (blue circles). How can we explain these cars?
We might be able to explain it by their class attribute, after all not all large engine vehicles are SUV. Let’s use class column to color the points. To do this we need to map this field to an aesthetic. An aesthetic is a visual property of the objects in our plot. Aesthetics include things like size, shape, or the color of our points.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
The colors reveal that many of the unusual points are two-seater cars. These sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage.
This is nice that ggplot will automatically pick up colors for us, we can also set the scale color:
# Change range of hues used
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
scale_color_hue(h = c(0, 100))
Or we could set them manually using hex color codes:
# Set manually
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
scale_color_manual(values = c("#f44242", "#f47741", "#f4c441", "#dff441", "#82f441", "#41f4e8", "#419af4"))
You can also set the aesthetic properties of all points together. For example, we can make all of the points in our plot blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "#f44242") # hex coloring
What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
# Your answer goes here
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
# Your answer goes here
What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
# Your answer goes here
## Facets layer One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be discrete.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
## Geometries layer How are these two plots similar?
Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data. To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use this code:
Left > ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
Right > ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
Now we can output both of these layers on top of each other:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If you look at the help page for geom_smooth() you will see method = "auto" as a default. method is the smoothing method, since the default is on "auto" it picked loess (LOcal regrESSion), we can change it to linear regression, for instance, by method = "glm":
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy), method = "glm")
Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom.
A few examples:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv, linetype = drv)
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What does show.legend = FALSE do? Show with an example
# Your answer goes here
Recreate the R code necessary to generate the following graphs:
# Your answer goes here
What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
# Your answer goes here
## Position adjustments You can color a bar chart using either the color aesthetic, or, more usefully, fill:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked
# Fill with clarity
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
# Custom color
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity)) +
scale_fill_manual(values = c("#41f4f4", "#41d9f4", "#41bbf4", "#418ef4", "#415ef4", "#6a41f4", "#9741f4", "#f441f1"))
The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill".
position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we can make the bars slightly transparent by setting alpha to a small value:# position = "identity"
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups:# position = "fill"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position = "dodge" places overlapping objects directly beside one another:# position = "dodge"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
scale_fill_manual(values = c("#41f4f4", "#41d9f4", "#41bbf4", "#418ef4", "#415ef4", "#6a41f4", "#9741f4", "#f441f1"))
A histogram is an accurate representation of the distribution of numerical data.
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 500, fill = "black", color = "white")
### Jitter There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. If you recall our very first plot we were looking at a dataset with 234 observations, but in fact the chart shows only 126 points
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. To show all the points we can add position = "jitter" to the function and ggplot() will automatically shift the point just enough that they are visible.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
We can also add some transparency to make the points more visible, using alpha
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter", alpha = 1/2)
To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.
Use position_jitter() to modify the amount of jittering. What parameters to geom_jitter() control the amount of jittering?
We haven’t plotted a boxplot yet, check out the documentation for geom_boxplot() and look at some the examples. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.
ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION>
In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.
Bar Chart from Top 50 ggplot2 Visualizations.
We saw how to create a bar chart with geom_bar(). By default geom_bar() will use stat = "count", so we don’t need to provide a y, it’s being calculated as the count of points in each bin. In order to create a bar chart with a given y value we need to set stat=identity and provide both x and y inside aes() x is either character or factor and y is numeric.
# create a frequency table
freqtable <- table(mpg$manufacturer)
df <- as.data.frame.table(freqtable)
head(df)
## Var1 Freq
## 1 audi 18
## 2 chevrolet 19
## 3 dodge 37
## 4 ford 25
## 5 honda 9
## 6 hyundai 14
#theme_set(theme_classic())
# Plot
ggplot(df, aes(Var1, Freq)) +
geom_bar(stat="identity", width = 0.5, fill="tomato2") +
labs(title="Bar Chart",
subtitle="Manufacturer of vehicles",
caption="Source: Frequency of Manufacturers from 'mpg' dataset") +
xlab("Manufacturer") +
theme_classic() +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) # to give x labels an angle for readability
#### Time Series Using geom_line(), a time series (or line chart) can be drawn. Data: economics from ggplot2.
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <int> <dbl> <dbl> <int>
## 1 1967-07-01 507. 198712 12.5 4.5 2944
## 2 1967-08-01 510. 198911 12.5 4.7 2945
## 3 1967-09-01 516. 199113 11.7 4.6 2958
## 4 1967-10-01 513. 199311 12.5 4.9 3143
## 5 1967-11-01 518. 199498 12.5 4.7 3066
## 6 1967-12-01 526. 199657 12.1 4.8 3018
# Allow Default X Axis Labels
ggplot(economics, aes(x=date)) +
geom_line(aes(y=psavert)) +
labs(title="US economic time series",
subtitle = "Personal Savings Rate",
caption="Source: Economics",
y="Savings Rate %") +
theme_classic()
Email Campaign Funnel from Top 50 ggplot2 Visualizations:
options(scipen = 999) # turns off scientific notations like 1e+40
options(repr.plot.width=7, repr.plot.height=5) # Modifying the chart size
# Read data
options(readr.num_columns = 0) # turns off messages printed by read_csv
email_campaign_funnel <- read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
# X Axis Breaks and Labels
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")
# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) + # Fill column
geom_bar(stat = "identity", width = .6) + # draw the bars
scale_y_continuous(breaks = brks, # Breaks
labels = lbls) + # Labels
coord_flip() + # Flip axes
labs(title = "Email Campaign Funnel") +
theme_tufte() + # Tufte theme from ggfortify
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) # Centre plot title
### ggthemes First let’s look at a simple scatterplot made by geom_point() and with no themes:
options(repr.plot.width=5, repr.plot.height=3) # Modifying the chart size, back to the regular size
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
ggtitle("Cars")
p
p2 <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(gear))) +
geom_point() +
ggtitle("Cars")
Minimal theme and geoms based on plots in The Visual Display of Quantitative Information.
p + geom_rangeframe() +
theme_tufte()
A theme that approximates the style of plots in The Economist magazine.
p + theme_economist() + scale_colour_economist()
For that classic ugly look and feel
p2 + theme_excel() + scale_colour_excel()
Theme and some color palettes based on plots in the The Wall Street Journal.
p2 + theme_wsj() + scale_colour_wsj("colors6", "")
# Interactive plotting with Plotly We can use package plotly on top of a ggplot plot to create interactive charts. Plotly is a powerful tool for creating interactive dashboards and plots and there are different ways to use it. Here we will only show how to make ggplots into a plotly by using ggplotly() function. For more information about other ways to leverage this package go to ploy.ly.
Let’s use the most recent plot we created with WSJ theme as an example:
p2 + theme_wsj() + scale_colour_wsj("colors6", "")
ggplotly(p2 + theme_wsj() + scale_colour_wsj("colors6", "")) # Same plot with ggplotly()
## Warning: plotly.js does not (yet) support horizontal legend items
## You can track progress here:
## https://github.com/plotly/plotly.js/issues/53
ts_plot <- ggplot(economics, aes(x=date)) +
geom_line(aes(y=psavert)) +
labs(title="US economic time series",
subtitle = "Personal Savings Rate",
caption="Source: Economics",
y="Savings Rate %") +
theme_classic()
ts_plot
ggplotly(ts_plot) # Same plot with ggplotly()
bar_plot <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
scale_fill_manual(values = c("#41f4f4", "#41d9f4", "#41bbf4", "#418ef4", "#415ef4", "#6a41f4", "#9741f4", "#f441f1"))
bar_plot
ggplotly(bar_plot) # Same plot with ggplotly()
# Additional Recourses * Examples of elaborate charts: Top 50 ggplot2 Visualizations * To go beyond ggplot2 functionalities check out these extensions: ggplot2 extensions * ggplot2-cheatsheet.pdf in the cheatsheets directory * Hex color * Simply google “hex color picker” and use Google’s tool * There are many other online sources, a highly customizable tool can be found in Mozilla * Themes * ggthemes - Examples here * ggtech